1 Climate change and temperature anomalies

If we wanted to study climate change, we can find data on the Combined Land-Surface Air and Sea-Surface Water Temperature Anomalies in the Northern Hemisphere at NASA’s Goddard Institute for Space Studies. The tabular data of temperature anomalies can be found here

To define temperature anomalies you need to have a reference, or base, period which NASA clearly states that it is the period between 1951-1980.

Run the code below to load the file:

weather <- 
  read_csv("https://data.giss.nasa.gov/gistemp/tabledata_v4/NH.Ts+dSST.csv", 
           skip = 1, 
           na = "***")

Notice that, when using this function, we added two options: skip and na.

  1. The skip=1 option is there as the real data table only starts in Row 2, so we need to skip one row.
  2. na = "***" option informs R how missing observations in the spreadsheet are coded. When looking at the spreadsheet, you can see that missing data is coded as "***". It is best to specify this here, as otherwise some of the data is not recognized as numeric data.

Once the data is loaded, notice that there is a object titled weather in the Environment panel. If you cannot see the panel (usually on the top-right), go to Tools > Global Options > Pane Layout and tick the checkbox next to Environment. Click on the weather object, and the dataframe will pop up on a seperate tab. Inspect the dataframe.

For each month and year, the dataframe shows the deviation of temperature from the normal (expected). Further the dataframe is in wide format.

You have two objectives in this section:

  1. Select the year and the twelve month variables from the weather dataset. We do not need the others (J-D, D-N, DJF, etc.) for this assignment. Hint: use select() function.

  2. Convert the dataframe from wide to ‘long’ format. Hint: use gather() or pivot_longer() function. Name the new dataframe as tidyweather, name the variable containing the name of the month as month, and the temperature deviation values as delta.

weather_12m <- weather%>%
  clean_names()%>%
  select(year, jan, feb, mar, apr, may, jun, jul, aug, sep, oct, nov, dec)

tidyweather <- weather_12m%>%
  pivot_longer(cols = 2:13,
               names_to = "month",
               values_to = "delta")

head(tidyweather)
## # A tibble: 6 × 3
##    year month delta
##   <dbl> <chr> <dbl>
## 1  1880 jan   -0.35
## 2  1880 feb   -0.5 
## 3  1880 mar   -0.23
## 4  1880 apr   -0.29
## 5  1880 may   -0.05
## 6  1880 jun   -0.15

Inspect your dataframe. It should have three variables now, one each for

  1. year,
  2. month, and
  3. delta, or temperature deviation.

1.1 Plotting Information

Let us plot the data using a time-series scatter plot, and add a trendline. To do that, we first need to create a new variable called date in order to ensure that the delta values are plot chronologically.

In the following chunk of code, I used the eval=FALSE argument, which does not run a chunk of code; I did so that you can knit the document before tidying the data and creating a new dataframe tidyweather. When you actually want to run this code and knit your document, you must delete eval=FALSE, not just here but in all chunks were eval=FALSE appears.

tidyweather <- tidyweather %>%
  mutate(date = ymd(paste(as.character(year), month, "1")),
         month = month(date, label=TRUE),
         year = year(date))

ggplot(tidyweather, aes(x=date, y = delta))+
  geom_point()+
  geom_smooth(color="red") +
  theme_bw() +
  labs (
    title = "Weather Anomalies"
  )

Is the effect of increasing temperature more pronounced in some months? Use facet_wrap() to produce a seperate scatter plot for each month, again with a smoothing line. Your chart should human-readable labels; that is, each month should be labeled “Jan”, “Feb”, “Mar” (full or abbreviated month names are fine), not 1, 2, 3.

>TEAM ANSWER>

After breaking the data down into months it looks like for the most part the affect of increasing temperature is similar accross the board.

It is sometimes useful to group data into different time periods to study historical data. For example, we often refer to decades such as 1970s, 1980s, 1990s etc. to refer to a period of time. NASA calcuialtes a temperature anomaly, as difference form the base period of 1951-1980. The code below creates a new data frame called comparison that groups data in five time periods: 1881-1920, 1921-1950, 1951-1980, 1981-2010 and 2011-present.

We remove data before 1800 and before using filter. Then, we use the mutate function to create a new variable interval which contains information on which period each observation belongs to. We can assign the different periods using case_when().

comparison <- tidyweather %>% 
  filter(year>= 1881) %>%     #remove years prior to 1881
  #create new variable 'interval', and assign values based on criteria below:
  mutate(interval = case_when(
    year %in% c(1881:1920) ~ "1881-1920",
    year %in% c(1921:1950) ~ "1921-1950",
    year %in% c(1951:1980) ~ "1951-1980",
    year %in% c(1981:2010) ~ "1981-2010",
    TRUE ~ "2011-present"
  ))

head(comparison)
## # A tibble: 6 × 5
##    year month delta date       interval 
##   <dbl> <ord> <dbl> <date>     <chr>    
## 1  1881 Jan   -0.31 1881-01-01 1881-1920
## 2  1881 Feb   -0.22 1881-02-01 1881-1920
## 3  1881 Mar   -0.04 1881-03-01 1881-1920
## 4  1881 Apr    0    1881-04-01 1881-1920
## 5  1881 May    0.04 1881-05-01 1881-1920
## 6  1881 Jun   -0.32 1881-06-01 1881-1920

Inspect the comparison dataframe by clicking on it in the Environment pane.

Now that we have the interval variable, we can create a density plot to study the distribution of monthly deviations (delta), grouped by the different time periods we are interested in. Set fill to interval to group and colour the data by different time periods.

ggplot(comparison, aes(x=delta, fill=interval))+
  geom_density(alpha=0.2) +   #density plot with tranparency set to 20%
  theme_bw() +                #theme
  labs (
    title = "Density Plot for Monthly Temperature Anomalies",
    y     = "Density"         #changing y-axis label to sentence case
  )

So far, we have been working with monthly anomalies. However, we might be interested in average annual anomalies. We can do this by using group_by() and summarise(), followed by a scatter plot to display the result.

#creating yearly averages
average_annual_anomaly <- tidyweather %>% 
  group_by(year) %>%   #grouping data by Year
  
  # creating summaries for mean delta 
  # use `na.rm=TRUE` to eliminate NA (not available) values 
  summarise(mean_delta = mean(delta,na.rm = TRUE)
            ) 

#plotting the data:
ggplot(average_annual_anomaly, aes(x=year, y= mean_delta))+
  geom_point()+
  
  #Fit the best fit line, using LOESS method
  geom_smooth() +
  
  #change to theme_bw() to have white background + black frame around plot
  theme_bw() +
  labs (
    title = "Average Yearly Anomaly",
    y     = "Average Annual Delta"
  )                         

1.2 Confidence Interval for delta

NASA points out on their website that

A one-degree global change is significant because it takes a vast amount of heat to warm all the oceans, atmosphere, and land by that much. In the past, a one- to two-degree drop was all it took to plunge the Earth into the Little Ice Age.

Your task is to construct a confidence interval for the average annual delta since 2011, both using a formula and using a bootstrap simulation with the infer package. Recall that the dataframe comparison has already grouped temperature anomalies according to time intervals; we are only interested in what is happening between 2011-present.

formula_ci <- comparison %>%
  filter(interval == "2011-present")%>%
  group_by(interval)%>%
  summarise(mean_delta = mean(delta, na.rm=TRUE),
            sd_delta = sd(delta,na.rm=TRUE),
            count_delta = n(),
            se_delta = sd_delta/ sqrt(count_delta),
            t_critical = qt(0.975, count_delta - 1 ),
            lower = mean_delta - t_critical * se_delta,
            upper = mean_delta + t_critical * se_delta)

  # choose the interval 2011-present
  # what dplyr verb will you use? 

  # calculate summary statistics for temperature deviation (delta) 
  # calculate mean, SD, count, SE, lower/upper 95% CI
  # what dplyr verb will you use? 

#print out formula_CI
formula_ci
## # A tibble: 1 × 8
##   interval     mean_delta sd_delta count_delta se_delta t_critical lower upper
##   <chr>             <dbl>    <dbl>       <int>    <dbl>      <dbl> <dbl> <dbl>
## 1 2011-present       1.06    0.276         132   0.0240       1.98  1.01  1.11
library(infer)
library(tidyverse)

set.seed(1234)

# use the infer package to construct a 95% CI for delta

# bootstrap for MEAN rent
boot_comparison <- comparison %>%
  # Select 2-bedroom flat
  filter(interval == "2011-present") %>%
  
  # Specify the variable of interest
  specify(response = delta) %>%
  
  # Generate a bunch of bootstrap samples
  generate(reps = 1000, type = "bootstrap") %>%
  
  # Find the mean of each sample
  calculate(stat = "mean")

percentile_ci <- boot_comparison %>%
  get_ci(level = 0.95, type = "percentile")



visualize(boot_comparison) + 
  shade_ci(endpoints = percentile_ci,fill = "khaki")+
  geom_vline(xintercept = formula_ci$lower, colour = "red")+
  geom_vline(xintercept = formula_ci$upper, colour = "red")

What is the data showing us? Please type your answer after (and outside!) this blockquote. You have to explain what you have done, and the interpretation of the result. One paragraph max, please!

TEAM ANSWER

To create this visual we first used boot strapping to get a random sample of deltas from our dataset. After the bootstrap sample was created we used the sample to calculate a 95% confidence interval for the mean of deltas. On the graph the green lines show our bootstrapped confidence interval and tell us that we are 95% confident that the true mean delta is between about 1.016 and 1.12, while the red line (1.013; 1.108) show the calculated confidence interval from the sample.

2 General Social Survey (GSS)

The General Social Survey (GSS) gathers data on American society in order to monitor and explain trends in attitudes, behaviours, and attributes. Many trends have been tracked for decades, so one can see the evolution of attitudes, etc in American Society.

In this assignment we analyze data from the 2016 GSS sample data, using it to estimate values of population parameters of interest about US adults. The GSS sample data file has 2867 observations of 935 variables, but we are only interested in very few of these variables and you are using a smaller file.

gss <- read_csv(here::here("data", "smallgss2016.csv"), 
                na = c("", "Don't know",
                       "No answer", "Not applicable"))

You will also notice that many responses should not be taken into consideration, like “No Answer”, “Don’t Know”, “Not applicable”, “Refused to Answer”.

We will be creating 95% confidence intervals for population parameters. The variables we have are the following:

  • hours and minutes spent on email weekly. The responses to these questions are recorded in the emailhr and emailmin variables. For example, if the response is 2.50 hours, this would be recorded as emailhr = 2 and emailmin = 30.
  • snapchat, instagrm, twitter: whether respondents used these social media in 2016
  • sex: Female - Male
  • degree: highest education level attained

2.1 Instagram and Snapchat, by sex

Can we estimate the population proportion of Snapchat or Instagram users in 2016?

  1. Create a new variable, snap_insta that is Yes if the respondent reported using any of Snapchat (snapchat) or Instagram (instagrm), and No if not. If the recorded value was NA for both of these questions, the value in your new variable should also be NA.
  2. Calculate the proportion of Yes’s for snap_insta among those who answered the question, i.e. excluding NAs.
  3. Using the CI formula for proportions, please construct 95% CIs for men and women who used either Snapchat or Instagram
gss_numeric <- gss%>%
  mutate(instagrm = case_when(instagrm == "Yes" ~ 1,
                              instagrm == "No" ~ 0,
                              instagrm == NA ~ as.numeric(NA)),
         snapchat = case_when(snapchat == "Yes" ~ 1,
                              snapchat == "No" ~ 0,
                              snapchat == NA ~ as.numeric(NA)))%>%
  mutate(snap_insta = case_when(
    (instagrm == 1 | snapchat == 1) ~ 1,
    (snapchat == 0 & instagrm == 0) ~ 0,
    (snapchat == NA & instagrm == NA) ~ as.numeric(NA)
         ))


snap_insta_yes <- gss_numeric%>%
  filter(snap_insta == 1)%>%
  count()
  
snap_insta_response <- gss_numeric%>%
  filter(snap_insta == 0 | snap_insta == 1)%>%
  count()

sd_snap_insta <- sd(gss_numeric$snap_insta, na.rm=TRUE)

se_snap_insta <- sd_snap_insta/sqrt(snap_insta_response)

t_critical <- qt(0.975, snap_insta_response$n - 1)

snap_lower = mean(gss_numeric$snap_insta, na.rm=TRUE) - t_critical*se_snap_insta

snap_upper = mean(gss_numeric$snap_insta, na.rm=TRUE)  + t_critical*se_snap_insta

snap_insta_ci <- data.frame(snap_insta_response, sd_snap_insta, se_snap_insta, t_critical, snap_lower, snap_upper)
colnames(snap_insta_ci)<-c("number_of_responses","sd_snap_insta","se_snap_insta","t_critical","lower","upper")

snap_insta_ci
##   number_of_responses sd_snap_insta se_snap_insta t_critical lower upper
## 1                1372         0.484        0.0131       1.96 0.349   0.4

2.2 Twitter, by education level

Can we estimate the population proportion of Twitter users by education level in 2016?.

There are 5 education levels in variable degree which, in ascneding order of years of education, are Lt high school, High School, Junior college, Bachelor, Graduate.

  1. Turn degree from a character variable into a factor variable. Make sure the order is the correct one and that levels are not sorted alphabetically which is what R by default does.
  2. Create a new variable, bachelor_graduate that is Yes if the respondent has either a Bachelor or Graduate degree. As before, if the recorded value for either was NA, the value in your new variable should also be NA.
  3. Calculate the proportion of bachelor_graduate who do (Yes) and who don’t (No) use twitter.
  4. Using the CI formula for proportions, please construct two 95% CIs for bachelor_graduate vs whether they use (Yes) and don’t (No) use twitter.
  5. Do these two Confidence Intervals overlap?
degree <- factor(gss$degree,levels=c("Lt high school", "High School", "Junior college", "Bachelor", "Graduate"))

gss_twit <- gss %>% 
  mutate(bachelor_graduate = case_when(degree=="Bachelor"~ "Yes",
                                       degree=="Graduate" ~ "Yes",
                                       degree==NA ~ as.character(NA),
                                       TRUE ~ "No"))

yes_twit <- gss_twit %>% 
  filter(twitter=="Yes",bachelor_graduate=="Yes") %>% 
  summarise(count=n())
  
total_twit <- gss_twit %>% 
  filter(!is.na(bachelor_graduate)) %>% 
  summarise(count=n())

prop_twit <- yes_twit/total_twit

se_twit <- sqrt((prop_twit*(1-prop_twit))/total_twit)

t_critical = qt(0.975, total_twit$count-1)

twit_lower <- prop_twit - t_critical*se_twit
twit_upper <- prop_twit + t_critical*se_twit

twitter_ci <- data.frame(total_twit, se_twit, t_critical, twit_lower, twit_upper)
colnames(twitter_ci)<-c("total_twitter_responses","se_twitter","t_critical","twitter_lower","twitter_upper")

print(snap_insta_ci)
##   number_of_responses sd_snap_insta se_snap_insta t_critical lower upper
## 1                1372         0.484        0.0131       1.96 0.349   0.4
print(twitter_ci)
##   total_twitter_responses se_twitter t_critical twitter_lower twitter_upper
## 1                    2867    0.00365       1.96        0.0326        0.0469

TEAM ANSWER

The two confidence intervals do not overlap which tells us that we have statistically significant evidence to conclude there is a difference in mean proportion between snapchat/instagram users and twitter users that are bachelor graduates in 2016. If the two intervals overlapped there could still be significant evidence to conclude a difference but we would need to construct a hypothesis test or a combined confidence interval to be sure.

2.3 Email usage

Can we estimate the population parameter on time spent on email weekly?

  1. Create a new variable called email that combines emailhr and emailmin to reports the number of minutes the respondents spend on email weekly.
  2. Visualise the distribution of this new variable. Find the mean and the median number of minutes respondents spend on email weekly. Is the mean or the median a better measure of the typical amoung of time Americans spend on email weekly? Why?
  3. Using the infer package, calculate a 95% bootstrap confidence interval for the mean amount of time Americans spend on email weekly. Interpret this interval in context of the data, reporting its endpoints in “humanized” units (e.g. instead of 108 minutes, report 1 hr and 8 minutes). If you get a result that seems a bit odd, discuss why you think this might be the case.
  4. Would you expect a 99% confidence interval to be wider or narrower than the interval you calculated above? Explain your reasoning.
gss_email <- gss%>%
  mutate(email = as.numeric(emailhr)*60 + as.numeric(emailmin))

ggplot(gss_email, aes(email))+
  geom_density()+
  theme_bw() +
  labs (
    title = "Density Plot of Time Spent on Email Weekly"
  )                         

mean(gss_email$email, na.rm=TRUE)
## [1] 417
median(gss_email$email, na.rm=TRUE) #use median because the data is very right skewed
## [1] 120
email_ci <- gss_email%>%
  filter(!is.na(email))%>%
  summarise(mean_email = mean(email, na.rm=TRUE),
            sd_email = sd(email, na.rm = TRUE),
            count_email = count(email),
            se_email = sd_email/sqrt(count_email),
            t_critical = qt(0.975, count_email-1),
            lower = mean_email - t_critical*se_email,
            upper = mean_email + t_critical* se_email)
set.seed(1234)

# use the infer package to construct a 95% CI for delta


boot_email_usage <- gss_email %>%
  
  filter(!is.na(email))%>%
  
  # Specify the variable of interest
  specify(response = email) %>%
  
  # Generate a bunch of bootstrap samples
  generate(reps = 1000, type = "bootstrap") %>%
  
  # Find the mean of each sample
  calculate(stat = "mean")

percentile_ci <- boot_email_usage %>%
  get_ci(level = 0.95, type = "percentile")



visualize(boot_email_usage) + 
  shade_ci(endpoints = percentile_ci,fill = "khaki")+
  geom_vline(xintercept = email_ci$lower, colour = "red")+
  geom_vline(xintercept = email_ci$upper, colour = "red")

> TEAM ANSWER

If we were to construct a 99% confidence interval instead of the 95% interval above, we would expect the interval to be wider. This is because if we change nothing but our confidence level we will need to widen the interval in order to be more confident the true mean lies within.

3 Biden’s Approval Margins

As we saw in class, fivethirtyeight.com has detailed data on all polls that track the president’s approval

# Import approval polls data directly off fivethirtyeight website
approval_polllist <- read_csv('https://projects.fivethirtyeight.com/biden-approval-data/approval_polllist.csv') 

glimpse(approval_polllist)
## Rows: 1,589
## Columns: 22
## $ president           <chr> "Joseph R. Biden Jr.", "Joseph R. Biden Jr.", "Jos…
## $ subgroup            <chr> "All polls", "All polls", "All polls", "All polls"…
## $ modeldate           <chr> "9/8/2021", "9/8/2021", "9/8/2021", "9/8/2021", "9…
## $ startdate           <chr> "1/20/2021", "1/20/2021", "1/22/2021", "1/21/2021"…
## $ enddate             <chr> "1/24/2021", "1/26/2021", "1/24/2021", "1/25/2021"…
## $ pollster            <chr> "Rasmussen Reports/Pulse Opinion Research", "Echel…
## $ grade               <chr> "B", "B/C", "B", "B", "A", "B", "B", "B", "B", "B"…
## $ samplesize          <dbl> 1500, 1006, 15000, 1500, 809, 1164, 1990, 15000, 2…
## $ population          <chr> "lv", "rv", "a", "lv", "a", "lv", "rv", "a", "a", …
## $ weight              <dbl> 0.1925, 1.1880, 0.2585, 0.5373, 1.5308, 1.4552, 0.…
## $ influence           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ approve             <dbl> 48.0, 55.0, 53.0, 48.0, 54.0, 57.0, 59.0, 53.0, 56…
## $ disapprove          <dbl> 47.0, 33.0, 29.0, 47.0, 30.0, 36.0, 32.0, 30.0, 32…
## $ adjusted_approve    <dbl> 50.5, 54.6, 51.5, 50.5, 54.8, 55.8, 57.5, 51.5, 54…
## $ adjusted_disapprove <dbl> 40.8, 32.3, 32.3, 40.8, 29.8, 35.4, 35.3, 33.3, 35…
## $ multiversions       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ tracking            <lgl> TRUE, NA, TRUE, TRUE, NA, NA, NA, TRUE, NA, TRUE, …
## $ url                 <chr> "https://www.rasmussenreports.com/public_content/p…
## $ poll_id             <dbl> 74249, 74329, 74257, 74259, 74263, 74271, 74264, 7…
## $ question_id         <dbl> 139406, 139616, 139494, 139428, 139456, 139490, 13…
## $ createddate         <chr> "1/25/2021", "2/3/2021", "1/28/2021", "1/26/2021",…
## $ timestamp           <chr> "09:25:08  8 Sep 2021", "09:25:08  8 Sep 2021", "0…
# Use `lubridate` to fix dates, as they are given as characters.

approval_polllist <- approval_polllist%>%
  filter(subgroup == "All polls")%>%
  mutate(enddate = mdy(enddate),
         startdate = mdy(startdate),
         modeldate = mdy(modeldate),
         week = isoweek(enddate))
  # filter(startdate >= dmy("20/01/2021"))

head(approval_polllist)
## # A tibble: 6 × 23
##   president  subgroup modeldate  startdate  enddate    pollster grade samplesize
##   <chr>      <chr>    <date>     <date>     <date>     <chr>    <chr>      <dbl>
## 1 Joseph R.… All pol… 2021-09-08 2021-01-20 2021-01-24 Rasmuss… B           1500
## 2 Joseph R.… All pol… 2021-09-08 2021-01-20 2021-01-26 Echelon… B/C         1006
## 3 Joseph R.… All pol… 2021-09-08 2021-01-22 2021-01-24 Morning… B          15000
## 4 Joseph R.… All pol… 2021-09-08 2021-01-21 2021-01-25 Rasmuss… B           1500
## 5 Joseph R.… All pol… 2021-09-08 2021-01-21 2021-01-24 Monmout… A            809
## 6 Joseph R.… All pol… 2021-09-08 2021-01-22 2021-01-25 Data fo… B           1164
## # … with 15 more variables: population <chr>, weight <dbl>, influence <dbl>,
## #   approve <dbl>, disapprove <dbl>, adjusted_approve <dbl>,
## #   adjusted_disapprove <dbl>, multiversions <chr>, tracking <lgl>, url <chr>,
## #   poll_id <dbl>, question_id <dbl>, createddate <chr>, timestamp <chr>,
## #   week <dbl>

3.1 Create a plot

What I would like you to do is to calculate the average net approval rate (approve- disapprove) for each week since he got into office. I want you plot the net approval, along with its 95% confidence interval. There are various dates given for each poll, please use enddate, i.e., the date the poll ended.

Also, please add an orange line at zero. Your plot should look like this:

#Calculate average net approval rate
library(Hmisc)

approval_polllist_net <- approval_polllist %>%
  filter(!is.na(approve), !is.na(disapprove))%>%
  mutate(net_approve = approve - disapprove)%>%
  # mutate(net_approve = adjusted_approve - adjusted_disapprove)%>%
  group_by(week)%>%
  summarise(
            # mean_net_approve = weighted.mean(net_approve, weight),
            mean_net_approve = mean(net_approve, na.rm=TRUE),
            count_net_approve = n(),
            # sd_net_approve = sqrt(wtd.var(net_approve, weight)),
            sd_net_approve = sd(net_approve,na.rm=TRUE),
            se_net_approve = sd_net_approve/sqrt(count_net_approve),
            t_critical = qt(0.975, count_net_approve-1),
            lower = mean_net_approve - t_critical*se_net_approve,
            upper = mean_net_approve + t_critical*se_net_approve)
  
lower_upper=approval_polllist_net %>% select(week, lower, upper)


#Plot net approval with 95% CI

biden_graph <- ggplot(approval_polllist_net, aes(x=week, y=mean_net_approve))+
  geom_point(color="red") +
  geom_smooth(se=FALSE, color="blue", size=1)+
  geom_line(color="red", size=0.2)+
  geom_line(aes(x = week, y = lower, color="Orange"))+
  geom_line(aes(x = week, y = upper, color="Orange"))+
  theme_bw()+
  labs (
    title = "Estimating Appproval Margin (approve - disapprove) for Joe Biden",
    subtitle = "Weekly average of all polls 2021",
    x     = "Week of the Year",
    y     = "Average Approval Margin (Approve - Disapprove"
    ) +
    annotate("line", x = seq(-5,35), y = 0, lty = 1, color = "Orange", size =2)+
  xlim(0,35) +
  ylim(-20,50)+
  geom_ribbon(aes(ymin=lower, ymax=upper), alpha=0.2, color="lightgrey")+
  NULL

biden_graph + theme(legend.position="none")

3.2 Compare Confidence Intervals

Compare the confidence intervals for week 3 and week 25. Can you explain what’s going on? One paragraph would be enough.

TEAM ANSWER:

The confidance interval for week 3 is much wider than week 25 because the sample size of the polls is smaller during weeek 3 than 25 (n=3 vs n=18) resulting in a higher t critical value.

4 Gapminder revisited

Recall the gapminder data frame from the gapminder package. That data frame contains just six columns from the larger data in Gapminder World. In this part, you will join a few dataframes with more data than the ‘gapminder’ package. Specifically, you will look at data on

You must use the wbstats package to download data from the World Bank. The relevant World Bank indicators are SP.DYN.TFRT.IN, SE.PRM.NENR, NY.GDP.PCAP.KD, and SH.DYN.MORT

# load gapminder HIV data
hiv <- read_csv(here::here("data","adults_with_hiv_percent_age_15_49.csv"))
life_expectancy <- read_csv(here::here("data","life_expectancy_years.csv"))

# get World bank data using wbstats
indicators <- c("SP.DYN.TFRT.IN","SE.PRM.NENR", "SH.DYN.MORT", "NY.GDP.PCAP.KD")


library(wbstats)

worldbank_data <- wb_data(country="countries_only", #countries only- no aggregates like Latin America, Europe, etc.
                          indicator = indicators, 
                          start_date = 1960, 
                          end_date = 2016)

# get a dataframe of information regarding countries, indicators, sources, regions, indicator topics, lending types, income levels,  from the World Bank API 
countries <-  wbstats::wb_cachelist$countries
library(lubridate)

hiv_longer <- hiv %>%
  pivot_longer(cols = 2:34,
               names_to = "date",
               values_to = "hiv") %>%
  mutate(date=as.numeric(date))

life_expectancy_longer <- life_expectancy %>%
  pivot_longer(cols = 2:302,
               names_to = "date",
               values_to = "life_expectancy")%>%
  mutate(date=as.numeric(date))

region <- countries %>% 
  select(country,region)

hiv_longer <- hiv_longer%>%
  na.omit()

life_expectancy <- life_expectancy%>%
  na.omit()

worldbank_join <- worldbank_data %>%
  left_join(hiv_longer,by =c("date","country")) %>%
  left_join(life_expectancy_longer,by =c("date","country"))%>%
  left_join(region,by =c("country"))



# using left join to match the date and country from worldbank_data, because some of the yearly data in the hiv data and life expectancy data are not used for our analytics

You have to join the 3 dataframes (life_expectancy, worldbank_data, and HIV) into one. You may need to tidy your data first and then perform join operations. Think about what type makes the most sense and explain why you chose it.

  1. What is the relationship between HIV prevalence and life expectancy? Generate a scatterplot with a smoothing line to report your results. You may find faceting useful
worldbank_join %>%
  filter(!is.na(hiv), !is.na(life_expectancy), !is.na(region))%>%
  group_by(region)%>%
  filter(date == max(date))%>%
  ungroup(region)%>%
  ggplot(aes(x= life_expectancy,
             y= hiv))+
  geom_point()+
  geom_smooth(method = "lm")+
  facet_wrap(~region)+
  labs(title="Relationship between HIV Prevalence and Life Expectancy by Region",
       x="Life Expectancy",
       y="HIV prevalence")

> TEAM ANSWER:

In most regions HIV prevalance and life expectancy has a negative corellation, but in most regions apart from Sub-Saharan Africa HIV prevalence is so low there is not much correlation to be seen. While in the Sub_Saharan Africa the correlation is more prounounced as HIV is more prevalent.

  1. What is the relationship between fertility rate and GDP per capita? Generate a scatterplot with a smoothing line to report your results. You may find facetting by region useful
worldbank_join %>%
  filter(!is.na(SP.DYN.TFRT.IN)&!is.na(NY.GDP.PCAP.KD))%>%
  group_by(country)%>%
  summarise(date, country, region,SP.DYN.TFRT.IN = mean(SP.DYN.TFRT.IN),
            NY.GDP.PCAP.KD = mean(NY.GDP.PCAP.KD))%>%
  select(SP.DYN.TFRT.IN,NY.GDP.PCAP.KD, date, country, region)%>%
  ungroup(country)%>%
  ggplot(aes(x= NY.GDP.PCAP.KD,
             y= SP.DYN.TFRT.IN))+
  geom_point()+
  geom_smooth(method = "lm")+
  facet_wrap(~region)+
  labs(title="Relationship between Fertility Rate and GDP Per Capita by Region",
       x="GDP Per Capita",
       y="Fertility Rate")

> TEAM ANSWER

Looking at the scatterplots above, it does not look like there is a strong relationship between GDP Per Capita and Fertility Rate. Some regions have more of a relationship than others. For example, it looks like Latin American & Caribbean, Eat Asia & Pacific, and Middle East & North Africa displat the strongest negative correlation between the two variables. Sub-Saharan Africa and South Asia look to have little to no relationship.

  1. Which regions have the most observations with missing HIV data? Generate a bar chart (geom_col()), in descending order.
worldbank_join %>% 
  filter(is.na(hiv),!is.na(region)) %>% 
  group_by(region) %>% 
  summarise(count=n()) %>% 
  ggplot(aes(x=reorder(region, -count),
             y=count))+
  geom_col()+
  labs(title="Missing HIV Data by Region",
       x="Region",
       y="Observations")+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 20, hjust = 0.8))+
  NULL

> TEAM ANSWER

Europe and Central Asia has the most observations with missing hiv data.

  1. How has mortality rate for under 5 changed by region? In each region, find the top 5 countries that have seen the greatest improvement, as well as those 5 countries where mortality rates have had the least improvement or even deterioration.
worldbank_join %>% 
  filter(!is.na(SH.DYN.MORT)&!is.na(region)) %>% 
  group_by(date,region) %>% 
  summarise(mean=mean(SH.DYN.MORT)) %>% 
  ggplot(aes(x= date,
             y= mean))+
  geom_smooth()+
  facet_wrap(~region)+
  labs(title="Mortality Rate for Under 5 Changed by Region",
       x="Year",
       y="Mortality Rate for Under 5")

worldbank_join %>% 
  filter(!is.na(SH.DYN.MORT)&!is.na(region)) %>% 
  group_by(country) %>% 
  mutate(b5_mortality_end = SH.DYN.MORT)%>%
  filter(date == min(date))%>%
  mutate(b5_mortality_begin = SH.DYN.MORT)%>%
  summarise(b5_mortality = b5_mortality_end - b5_mortality_begin)
## # A tibble: 192 × 2
##    country             b5_mortality
##    <chr>                      <dbl>
##  1 Afghanistan                    0
##  2 Albania                        0
##  3 Algeria                        0
##  4 Andorra                        0
##  5 Angola                         0
##  6 Antigua and Barbuda            0
##  7 Argentina                      0
##  8 Armenia                        0
##  9 Australia                      0
## 10 Austria                        0
## # … with 182 more rows
top5_countryByRegion <- worldbank_join %>% 
  filter(!is.na(SH.DYN.MORT), !is.na(region)) %>%
  group_by(region,country)%>%
  mutate(max=max(date),min=min(date)) %>% 
  filter(date==max|date==min) %>%
  summarise(max_mort_change = diff(SH.DYN.MORT)) %>% 
  slice_min(order_by = max_mort_change, n=5)
  
bottom5_countryByRegion <- worldbank_join %>% 
  filter(!is.na(SH.DYN.MORT), !is.na(region)) %>%
  group_by(region,country)%>%
  mutate(max=max(date),min=min(date)) %>% 
  filter(date==max|date==min) %>%
  summarise(min_mort_change = diff(SH.DYN.MORT)) %>% 
  slice_max(order_by = min_mort_change, n=5)

TEAM ANSWER:

Mortality rate for under 5 has changed a lot in most regions since 1960. The regions with the most dramatic change are Sub-Saharan Africa, Middle Each & North Africa, and South Asia. Though Sub-Saharan Africa displays the largest change, the region still has the highest mortality rate today. No regions have increased their mortality rate overtime but North America and Eruope & Central Asia show very little change because in 1960 their mortality rate was already relatively low.

  1. Is there a relationship between primary school enrollment and fertility rate?
worldbank_join%>%
  filter(!is.na(SP.DYN.TFRT.IN), !is.na(SE.PRM.NENR))%>%
  group_by(country)%>%
  filter(date == max(date))%>%
  ungroup()%>%
  ggplot(aes(x = SP.DYN.TFRT.IN, y = SE.PRM.NENR))+
  geom_point()+
  geom_smooth(method = "lm")+
  labs(title = "Relationship between Fertility Rate and Primary School Enrollment",
       x= "Fertility Rate",
       y="Primary School Enrollment")

> TEAM ANSWER

The graph above shows a relatively weak negative correlation between fertility rate and Primary School Enrollment.

5 Challenge 1: Excess rentals in TfL bike sharing

Recall the TfL data on how many bikes were hired every single day. We can get the latest data by running the following

url <- "https://data.london.gov.uk/download/number-bicycle-hires/ac29363e-e0cb-47cc-a97a-e216d900a6b0/tfl-daily-cycle-hires.xlsx"

# Download TFL data to temporary file
httr::GET(url, write_disk(bike.temp <- tempfile(fileext = ".xlsx")))
## Response [https://airdrive-secure.s3-eu-west-1.amazonaws.com/london/dataset/number-bicycle-hires/2021-08-23T14%3A32%3A29/tfl-daily-cycle-hires.xlsx?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJDIMAIVZJDICKHA%2F20210909%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20210909T125326Z&X-Amz-Expires=300&X-Amz-Signature=67cc0adad27be71bca28d70f7db842004ffd37c12d470a98cc22bd8c02fe36f7&X-Amz-SignedHeaders=host]
##   Date: 2021-09-09 12:53
##   Status: 200
##   Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
##   Size: 173 kB
## <ON DISK>  /var/folders/h6/kmqd84j93_z506mgbtp4kqpm0000gn/T//Rtmp0XuKh8/file903771db9a12.xlsx
# Use read_excel to read it as dataframe
bike0 <- read_excel(bike.temp,
                   sheet = "Data",
                   range = cell_cols("A:B"))

# change dates to get year, month, and week
bike <- bike0 %>% 
  clean_names() %>% 
  rename (bikes_hired = number_of_bicycle_hires) %>% 
  mutate (year = year(day),
          month = lubridate::month(day, label = TRUE),
          week = isoweek(day))

We can easily create a facet grid that plots bikes hired by month and year.

Look at May and Jun and compare 2020 with the previous years. What’s happening?

However, the challenge I want you to work on is to reproduce the following two graphs.

expected_monthly <- bike%>%
  filter(day >= dmy("01/01/2016"), day<dmy("01/01/2020"))%>%
  group_by(month)%>%
  summarise(expected_avg = mean(bikes_hired))

monthly_rentals <- bike%>%
  filter(day >= dmy("01/01/2016"))%>%
  group_by(year,month) %>% 
  summarise(actual_avg=mean(bikes_hired)) %>% 
  left_join(expected_monthly, by = "month")

monthly_rentals %>% 
  ggplot(aes(x=as.numeric(month)))+
  geom_line(aes(y=expected_avg),color="blue")+
  geom_line(aes(y=actual_avg),color = "black")+
  geom_ribbon(aes(ymin=expected_avg, ymax=pmax(actual_avg,expected_avg)),fill="springgreen1", alpha = 0.3) +
  geom_ribbon(aes(ymin=pmin(actual_avg,expected_avg), ymax=expected_avg), fill="tomato", alpha = 0.3)+
  facet_wrap(~year)+
  theme_bw()+
  theme(legend.position = "none",
        strip.background = element_blank(),
        panel.border = element_blank(),
        plot.title = element_text(size = 9),
        plot.subtitle = element_text(size = 7),
        strip.text.x = element_text(size = 5),
        axis.text.y = element_text(size = 5),
        axis.text.x = element_text(size = 5))+
  scale_x_continuous(labels = function(x) month.abb[x])+
  labs(title = "Monthly change in Tfl bike rentals",
       subtitle = "Change from montly average shown in Blue and calculated between 2016-2019",
       x = "Month",
       y = "Bikes rentals")

The second one looks at percentage changes from the expected level of weekly rentals. The two grey shaded rectangles correspond to Q2 (weeks 14-26) and Q4 (weeks 40-52).

expected_weekly <- bike %>% 
  filter(day>=dmy("4/1/2016") & day<=dmy("29/12/2019")) %>% 
  group_by(week) %>% 
  summarise(expected_rentals=mean(bikes_hired))

weekly_rentals <- bike %>% 
  filter(day>dmy("4/1/2016")) %>% 
  group_by(year,week) %>%
  mutate(yearminusone = year - 1,
         year_week = ifelse(week==53 & month=="Jan",
                            paste(yearminusone,week,sep="-"),
                            paste(year,week,sep="-"))) %>%
  group_by(year_week) %>%
  mutate(actual_rentals = mean(bikes_hired)) %>% 
  filter(day==max(day)) %>%
  ungroup() %>%
  left_join(expected_weekly,by =c("week")) %>% 
  mutate(delta=(actual_rentals/expected_rentals- 1),
         delta = replace_na(delta, 1),
         month=ifelse(week==53,"Dec",month),
         year=ifelse(week==53,year-1,year)) %>% 
  add_row(year=2016,week=53,delta=0)

           

weekly_rentals %>% 
  ggplot(aes(x=week,
             y=delta))+
  geom_line(aes(y = delta)) +
  annotate("rect", xmin = 13, xmax = 26, ymin = -Inf, ymax = Inf, fill = "grey", alpha = 0.3)+
  annotate("rect", xmin = 39, xmax = 53, ymin = -Inf, ymax = Inf, fill = "grey", alpha = 0.3)+
  geom_ribbon(aes(ymin=0, ymax=pmax(0, delta), fill="#eab5b7", alpha = 0.3)) +
  geom_ribbon(aes(ymin=pmin(0, delta), ymax=0, fill="#c0e0c3", alpha = 0.3))+
  geom_rug(data=subset(weekly_rentals,delta>=0),color="#c0e0c3",sides="b")+
  geom_rug(data=subset(weekly_rentals,delta<0),color="#eab5b7",sides="b")+
  facet_wrap(~year)+
  scale_y_continuous(labels = scales::percent)+
  labs(title="Weekly changes in TfL bike rentals",
       subtitle="% change from weekly averages \ncalculated between 2016-2019",
       x="week",
       y="")+
  scale_x_continuous(breaks = c(13,26,39,53))+
  theme_bw()+
  theme(legend.position = "none",
        strip.background = element_blank(),
        panel.border = element_blank(),
        plot.title = element_text(size = 9),
        plot.subtitle = element_text(size = 7),
        strip.text.x = element_text(size = 5),
        axis.text.y = element_text(size = 5),
        axis.text.x = element_text(size = 5))

For both of these graphs, you have to calculate the expected number of rentals per week or month between 2016-2019 and then, see how each week/month of 2020-2021 compares to the expected rentals. Think of the calculation excess_rentals = actual_rentals - expected_rentals.

Should you use the mean or the median to calculate your expected rentals? Why?

In creating your plots, you may find these links useful:

6 Deliverables

As usual, there is a lot of explanatory text, comments, etc. You do not need these, so delete them and produce a stand-alone document that you could share with someone. Knit the edited and completed R Markdown file as an HTML document (use the “Knit” button at the top of the script editor window) and upload it to Canvas.

7 Details

  • Who did you collaborate with: Study Group 11 members
  • Approximately how much time did you spend on this problem set: 10 hours
  • What, if anything, gave you the most trouble: Monthly bike rentals graph, data manipulation for hiv and worldbank data.

Please seek out help when you need it, and remember the 15-minute rule. You know enough R (and have enough examples of code from class and your readings) to be able to do this. If you get stuck, ask for help from others, post a question on Slack– and remember that I am here to help too!

As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else?

8 Rubric

Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.

Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).

Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.